Copy, please, these files and directories to your personal directory:
cp -r /data/shared/AGE2020/Exercises/E03-qPCR ~/AGE2020/Exercises
And switch the R working directory to the current exercise: setwd("~/AGE2020/Exercises/E03-qPCR")
Quantitative real-time PCR monitors the amplification of a targeted DNA molecule during the PCR (i.e., in real time), not at its end, as in conventional PCR.
In real time reverse transcription quantitave PCR (RT-qPCR), we measure the amplification of target reverse transcribed DNA molecules. This amplification depends on initial RNA concentration and approximates the gene expression. We select a fluorescence signal threshold \(S_T\) and measure crossing points \(C_P\) (also called threshold cycle \(C_T\)) for samples. To correct for initial RNA concentration and quality, reference genes are used (dashed curves). These genes are considered to be equally expressed in every condition (purple and orange curves), and so serve as the baseline. First, we normalize each sample to reference gene (or they geometric mean) by calculating \(\Delta C_t = C_P^r - C_P^t\). Obtained \(\Delta C_t\) values are then used to calculate the relative change of gene expression between samples. In this case, we are interested how much more or less is gene expressed in \(\text{treatment}\) sample (\(\text{trt}\)), relative to \(\text{control}\) (\(\text{cnt}\)) sample. This relative change is called \(\text{fold-change}\): \(FC = 2^{-\Delta\Delta C_P} = 2^{-(\Delta C_{P, trt} - \Delta C_{P, cnt})}\). We usually report \(FC\) in \(log_2\) scale: \(LFC = \Delta C_{P, cnt} - \Delta C_{P, trt} = log_2(FC)\) Important is to realize that the higher \(C_P\) is, the lower initial mRNA concentration was. This is why we put minus /\(-\)/ in front of \(\Delta\Delta C_t\). We usually don’t consider the efficiency of PCR and expect the mRNA concentration to double in each cycle. Otherwise \(FC = (1+E)^{-\Delta\Delta C_P}\), where \(E\) is the PCR efficiency.
This will be an introductory exercise in which you will refresh your R skills and implement (mainly visualization) functions, which you will use later on different types of gene expression data.
We will work with data from pancreatic tumour experiment, where an influence of spirulina algae was investigated. In this experiment there are two sample groups:
Each sample group has three biological replicates (grown on different Petri dish). In addition, each biological replicate is cultivated for 50% and 90% confluence, and for each confluence there are two technical replicates. Uhhh…
But we can expand it to:
- control group
|
|-- biological replicate #1 (Petri dish #1)
| |-- confluence 50
| | |-- technical replicate #1 -> sample C1_C50_R1
| | |-- technical replicate #2 -> sample C2_C50_R2
| |-- confluence 90
| |-- technical replicate #1 -> sample C1_C90_R1
| |-- technical replicate #2 -> sample C1_C90_R2
|-- biological replicate #2 (Petri dish #2)
|-- confluence 50
| |-- technical replicate #1 -> sample C2_C50_R1
| |-- technical replicate #2 -> sample C2_C50_R2
|-- confluence 90
|-- ...
- spirulina group
|
|-- biological replicate #1 (Petri dish #1)
| |-- confluence 50
| | |-- technical replicate #1 -> sample S1_C50_R1
| | |-- technical replicate #2 -> sample S1_C50_R2
| |-- confluence 90
| |-- technical replicate #1 -> sample S1_C90_R1
| |-- technical replicate #2 -> sample S1_C90_R2
|-- biological replicate #2 (Petri dish #2)
|-- confluence 50
| |-- technical replicate #1 -> sample S2_C50_R1
| |-- technical replicate #2 -> sample S2_C50_R2
|-- confluence 90
|-- ...
If you want, we can draw it on the board :)
Also, there are two groups of measured genes:
library(dplyr)
library(stringr)
library(tidyr)
library(glue)
library(magrittr)
This will be your personal library in which you implement the tasks. Now it contains only function skeletons. These skeletons are not obligatory, but your functions should have output similar to what you will see in this tutorial.
Note that your functions will be universal, serving not only for qPCR data. This is mainly true for exploratory analysis functions (hiearchical clustering, PCA, etc.), where input (expression matrix) is same also for microarrays and RNA-seq data.
source("../age_library.R")
I would recommend you to always define constants at the beginning of a script. It is more transparent, and if you, for example, change some path, you haven’t to replace it in whole source code. Constants are, by convention, named in UPPERCASE.
CP_DATA <- "qPCR.Rds"
We have already preprocessed a \(C_P\) matrix for you. But if you really want to start from the floor, you can create the \(C_P\) matrix from original files (here and here).
In \(C_P\) matrix, columns are samples, rows are genes, and values are \(C_P\) values. This format is generally used for gene expression data and you will see it also in case of microarrays and RNA-Seq.
cp <- readRDS(CP_DATA) %>%
as.data.frame()
genes <- rownames(cp)
samples <- colnames(cp)
cp[1:5, 1:5]
CP values >= 40 are considered out of qPCR limit:
cp[cp > 39.9] <- NA
Some genes have so many NA values:
lq_genes <- rowSums(!is.na(cp))
lq_genes
## CD24 CD31 CXCL12 CXCR4 CXCR7 FLT1 VEGFA VEGRF2 ACTB H4 GAPDH H5 RPS9 H11 TBP H26
## 24 19 2 10 0 0 24 2 24 24 24 24
Let’s remove them:
lq_genes <- names(lq_genes[lq_genes <= 2])
genes <- setdiff(genes, lq_genes)
cp <- cp[genes, ]
Now we split genes to target and reference ones. Latter have H on the end of their names:
target_genes <- genes[str_detect(genes, ".+ H", negate = TRUE)]
ref_genes <- setdiff(genes, target_genes)
gene_groups <- if_else(str_detect(genes, " H\\d+$"), "reference", "target")
(genes_df <- data.frame(
gene = as.character(genes),
gene_group = factor(gene_groups, levels = c("reference", "target")),
row.names = genes,
stringsAsFactors = FALSE)
)
For plotting purposes, we replace NAs by 40:
cp_plot <- replace(cp, is.na(cp), 40)
You usually obtain a sample sheet, but in this case we are going to parse the column names of \(C_P\) matrix. Everything needed is contained there:
(sample_names <- colnames(cp))
## [1] "K1/50" "K1/50(2RT)" "K1/90" "K1/90 (2RT)" "K2/50" "K2/50(2RT)" "K2/90" "K2/90 (2RT)" "K3/50" "K3/50(2RT)" "K3/90"
## [12] "K3/90 (2RT)" "SP1/50" "SP1/50 (2RT)" "SP1/90" "SP1/90 (2RT)" "SP2/50" "SP2/50 (2RT)" "SP2/90" "SP2/90 (2RT)" "SP3/50" "SP3/50(2RT)"
## [23] "SP3/90" "SP3/90 (2RT)"
pheno_data <- data.frame(
sample_id = sample_names,
sample_group_code = str_match(sample_names, "^SP|K")[, 1] %>% recode_factor("K" = "c", "SP" = "sp"),
petri_number = str_match(sample_names, "^[A-Z]{1,2}(\\d)")[, 2] %>% as.factor(),
confluence = str_match(sample_names, "\\/(\\d{2})")[, 2] %>% as.factor(),
replicate = str_match(sample_names, "\\((\\d)RT\\)")[, 2] %>% replace_na("1") %>% as.factor()
) %>%
dplyr::mutate(
sample_name_rep = glue("{sample_group_code}{petri_number}_{confluence}_r{replicate}") %>% as.character(),
sample_name = glue("{sample_group_code}{petri_number}_{confluence}") %>% as.character(),
sample_group = recode_factor(sample_group_code, "c" = "control", "sp" = "spirulina")
) %>%
dplyr::select(sample_name_rep, sample_name, everything()) %>%
set_rownames(.$sample_name_rep)
colnames(cp) <- rownames(pheno_data)
colnames(cp_plot) <- rownames(pheno_data)
samples <- colnames(cp)
head(pheno_data)
We also create long data from cp, pheno_data and genes_df.
cp_long <- as.data.frame(cp) %>%
tibble::rownames_to_column("gene") %>%
tidyr::pivot_longer(-gene, names_to = "sample_name_rep", values_to = "cp") %>%
dplyr::left_join(pheno_data, by = "sample_name_rep") %>%
dplyr::left_join(genes_df, by = "gene") %>%
dplyr::mutate(
cp_plot = if_else(is.na(cp), 40, cp)
) %>%
dplyr::select(sample_name_rep, sample_name, gene, cp, cp_plot, gene_group, everything())
head(cp_long)
We are using exploratory analysis to have an unbiased first look at data. It is also a crucial step to assess the biological quality control, e.g. whether samples from different groups create separate clusters - if not, samples might be swapped.
TASK 1: Implement your own function which takes an expression matrix and returns a dendrogram with coloring by a chosen variable.
color_variables <- c("sample_group", "confluence", "petri_number")
for (variable in color_variables) {
plot_hc(
cp_plot, color_by = pheno_data[, variable],
method_distance = "euclidean", method_clustering = "complete",
color_by_lab = variable
)
}
TASK 2: Implement a function for PCA visualization with point coloring by a chosen variable.
plot_pca(cp_plot, color_by = pheno_data$sample_group, color_by_lab = "sample_group")
ggplot2, but be prepared to use data in long format:plot_pca_ggplot2(cp_plot, pheno_data, color_by = "sample_group", shape_by = "confluence")$plot
plot_pca_ggplot2(cp_plot, pheno_data, color_by = "sample_group", shape_by = "confluence", label_by = "sample_name_rep")$plot
plot_pca_ggplot2(cp_plot, pheno_data, color_by = "sample_group", shape_by = "petri_number")$plot
GGally::ggpairs() to plot a grid of PCA plots (PC1 vs. PC2, PC1 vs. PC3, PC2 vs. PC3, etc):plot_pca_ggpairs(cp_plot, pheno_data, n_components = 5, color_by = "sample_group", shape_by = "confluence")$plot
TASK 3: Implement your own function for heatmap.
You can use a nice package heatmaply which produces interactive HTML heatmaps. Or pheatmap for a better alternative to heatmap.2().
plot_heatmap(as.matrix(cp_plot), color_by = pheno_data$sample_group, color_legend_lab = "Sample group", main = "Treatment (Cp values)")
plot_pheatmap(cp_plot, pheno_data, genes_df, column_color_by = color_variables, row_color_by = "gene_group", main = "qPCR")
plot_heatmaply(cp_plot, pheno_data, genes_df, column_color_by = color_variables, row_color_by = "gene_group", main = "qPCR", key.title = "CP")
Can you see any outlying samples (replicates)? If yes, should they be removed?
We will use housekeeping genes as reference genes. Housekeeping genes are genes which have similar expression profile regardless the cell type or state. That is why they can be used for sample normalization (as an internal control).
Expression of housekeeping genes is/should be correlated.
main_title <- "Raw data correlation"
t(cp) %>%
as.data.frame() %>%
GGally::ggpairs(progress = FALSE) +
theme_bw()
You can also use base R:
pairs(t(cp))
Let’s look more closely at correlation coefficients:
(gene_cor <- cor(t(cp), use = "pairwise.complete.obs"))
## CD24 CD31 CXCR4 VEGFA ACTB H4 GAPDH H5 RPS9 H11 TBP H26
## CD24 1.00000000 0.03970592 -0.28809768 0.73649418 0.73547823 0.73141005 0.76319246 0.72419041
## CD31 0.03970592 1.00000000 -0.73606175 0.03747391 0.03172092 0.17532289 0.10746231 0.08438151
## CXCR4 -0.28809768 -0.73606175 1.00000000 -0.20149851 -0.18053811 -0.02185175 0.08141219 -0.25461239
## VEGFA 0.73649418 0.03747391 -0.20149851 1.00000000 0.77468336 0.82725327 0.83395485 0.77534719
## ACTB H4 0.73547823 0.03172092 -0.18053811 0.77468336 1.00000000 0.92233124 0.89743784 0.93865311
## GAPDH H5 0.73141005 0.17532289 -0.02185175 0.82725327 0.92233124 1.00000000 0.96305770 0.93496580
## RPS9 H11 0.76319246 0.10746231 0.08141219 0.83395485 0.89743784 0.96305770 1.00000000 0.90135469
## TBP H26 0.72419041 0.08438151 -0.25461239 0.77534719 0.93865311 0.93496580 0.90135469 1.00000000
ggcorrplot::ggcorrplot(gene_cor, method = "circle") +
ggtitle(main_title)
You can also use lattice:
lattice::levelplot(gene_cor, main = main_title)
Or heatmaply:
plot_heatmaply(gene_cor, feature_data = genes_df, row_color_by = "gene_group", main = main_title)
Housekeeping genes should cluster together. We can try both Euclidean and Pearson distance (1 - correlation coefficient) for hiearchical clustering.
plot_hc(gene_cor, color_by = gene_groups, color_by_lab = "Gene groups", method_distance = "euclidean", main = "Hierarchical clustering (Euclidean distance)")
plot_hc(gene_cor, color_by = gene_groups, color_by_lab = "Gene groups", method_distance = "pearson", main = "Hierarchical clustering (Pearson distance)")
PCA of correlation coefficients is also helpful:
plot_pca_ggplot2(gene_cor, genes_df, color_by = "gene_group")$plot
plot_pca_ggpairs(gene_cor, genes_df, color_by = "gene_group")$plot
TASK 4: Implement a function which computes the M-value.
All reference genes are OK:
for (g in ref_genes)
compute_m(g, cp_plot[ref_genes, ]) %>% glue("{g}: ", .) %>% cat(sep = "\n")
## ACTB H4: 0.587524037202414
## GAPDH H5: 0.675998845667576
## RPS9 H11: 0.73511830732382
## TBP H26: 0.545368225741302
Compare with target genes - they would not be good reference genes:
for (g in target_genes)
compute_m(g, cp_plot[target_genes, ]) %>% glue("{g}: ", .) %>% cat(sep = "\n")
## CD24: 1.62202887577501
## CD31: 2.50323654673845
## CXCR4: 2.18054789182113
## VEGFA: 1.59255501294409
For each sample we calculate a geometric mean of reference genes:
norm <- apply(cp[ref_genes, ], 2, function(x) {
log(x) %>% mean() %>% exp()
})
Let’s look how these means behave:
gene_cor_norm <- rbind(cp, norm = norm) %>%
t() %>%
cor(use = "pairwise.complete.obs")
genes_df_norm <- genes_df %>%
dplyr::mutate(gene_group = factor(gene_group, levels = c("reference", "target", "norm"))) %>%
rbind(norm = c("norm", "norm"))
ggcorrplot::ggcorrplot(gene_cor_norm, method = "circle") +
ggtitle(main_title)
plot_hc(gene_cor_norm, color_by = genes_df_norm$gene_group, color_by_lab = "Gene groups", method_distance = "euclidean", main = "Hierarchical clustering (Euclidean distance)")
plot_hc(gene_cor_norm, color_by = genes_df_norm$gene_group, color_by_lab = "Gene groups", method_distance = "pearson", main = "Hierarchical clustering (Pearson distance)")
plot_pca_ggplot2(gene_cor_norm, genes_df_norm, color_by = "gene_group")$plot
plot_pca_ggpairs(gene_cor_norm, genes_df_norm, color_by = "gene_group")$plot
We get \(\Delta C_P\) by subtracting the reference gene means:
(ref_gene_cp_matrix <- matrix(norm, ncol = ncol(cp), nrow = nrow(cp), byrow = TRUE))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19]
## [1,] 23.69656 23.90002 24.05644 24.08679 23.30099 23.48331 22.74242 22.87819 23.0607 23.2196 23.11612 23.0268 25.53741 25.50754 25.97521 25.79664 25.7239 25.71187 25.56222
## [2,] 23.69656 23.90002 24.05644 24.08679 23.30099 23.48331 22.74242 22.87819 23.0607 23.2196 23.11612 23.0268 25.53741 25.50754 25.97521 25.79664 25.7239 25.71187 25.56222
## [3,] 23.69656 23.90002 24.05644 24.08679 23.30099 23.48331 22.74242 22.87819 23.0607 23.2196 23.11612 23.0268 25.53741 25.50754 25.97521 25.79664 25.7239 25.71187 25.56222
## [4,] 23.69656 23.90002 24.05644 24.08679 23.30099 23.48331 22.74242 22.87819 23.0607 23.2196 23.11612 23.0268 25.53741 25.50754 25.97521 25.79664 25.7239 25.71187 25.56222
## [5,] 23.69656 23.90002 24.05644 24.08679 23.30099 23.48331 22.74242 22.87819 23.0607 23.2196 23.11612 23.0268 25.53741 25.50754 25.97521 25.79664 25.7239 25.71187 25.56222
## [6,] 23.69656 23.90002 24.05644 24.08679 23.30099 23.48331 22.74242 22.87819 23.0607 23.2196 23.11612 23.0268 25.53741 25.50754 25.97521 25.79664 25.7239 25.71187 25.56222
## [7,] 23.69656 23.90002 24.05644 24.08679 23.30099 23.48331 22.74242 22.87819 23.0607 23.2196 23.11612 23.0268 25.53741 25.50754 25.97521 25.79664 25.7239 25.71187 25.56222
## [8,] 23.69656 23.90002 24.05644 24.08679 23.30099 23.48331 22.74242 22.87819 23.0607 23.2196 23.11612 23.0268 25.53741 25.50754 25.97521 25.79664 25.7239 25.71187 25.56222
## [,20] [,21] [,22] [,23] [,24]
## [1,] 25.77045 23.89842 24.23634 24.82614 24.98597
## [2,] 25.77045 23.89842 24.23634 24.82614 24.98597
## [3,] 25.77045 23.89842 24.23634 24.82614 24.98597
## [4,] 25.77045 23.89842 24.23634 24.82614 24.98597
## [5,] 25.77045 23.89842 24.23634 24.82614 24.98597
## [6,] 25.77045 23.89842 24.23634 24.82614 24.98597
## [7,] 25.77045 23.89842 24.23634 24.82614 24.98597
## [8,] 25.77045 23.89842 24.23634 24.82614 24.98597
(delta_cp <- cp - ref_gene_cp_matrix)
Now we average technical replicates:
samples <- unique(pheno_data$sample_name)
delta_cp_avg <- matrix(NA, nrow = nrow(delta_cp), ncol = length(samples), dimnames = list(rownames(delta_cp), samples))
delta_cp_avg[1:5, 1:5]
## c1_50 c1_90 c2_50 c2_90 c3_50
## CD24 NA NA NA NA NA
## CD31 NA NA NA NA NA
## CXCR4 NA NA NA NA NA
## VEGFA NA NA NA NA NA
## ACTB H4 NA NA NA NA NA
(tech_rep <- split(colnames(delta_cp), pheno_data$sample_name))
## $c1_50
## [1] "c1_50_r1" "c1_50_r2"
##
## $c1_90
## [1] "c1_90_r1" "c1_90_r2"
##
## $c2_50
## [1] "c2_50_r1" "c2_50_r2"
##
## $c2_90
## [1] "c2_90_r1" "c2_90_r2"
##
## $c3_50
## [1] "c3_50_r1" "c3_50_r2"
##
## $c3_90
## [1] "c3_90_r1" "c3_90_r2"
##
## $sp1_50
## [1] "sp1_50_r1" "sp1_50_r2"
##
## $sp1_90
## [1] "sp1_90_r1" "sp1_90_r2"
##
## $sp2_50
## [1] "sp2_50_r1" "sp2_50_r2"
##
## $sp2_90
## [1] "sp2_90_r1" "sp2_90_r2"
##
## $sp3_50
## [1] "sp3_50_r1" "sp3_50_r2"
##
## $sp3_90
## [1] "sp3_90_r1" "sp3_90_r2"
for (s in samples)
delta_cp_avg[, s] <- rowMeans(delta_cp[, tech_rep[[s]], drop = FALSE], na.rm = TRUE)
# For control.
(rowMeans(delta_cp[, c("c1_50_r1", "c1_50_r2")], na.rm = TRUE) == delta_cp_avg[, "c1_50"]) %>%
all() %>%
stopifnot()
Final \(\Delta C_P\):
delta_cp_avg <- delta_cp_avg[target_genes, ]
delta_cp_avg_plot <- replace(delta_cp_avg, is.na(delta_cp_avg), ceiling(max(delta_cp_avg, na.rm = TRUE) + 1))
# Remove replicates.
pheno_data_avg <- dplyr::filter(pheno_data, replicate == 1) %>%
dplyr::select(-sample_name_rep, -sample_id, -replicate) %>%
dplyr::select(sample_name, everything()) %>%
set_rownames(.$sample_name)
cp_long <- dplyr::filter(cp_long, replicate == 1) %>%
dplyr::select(-sample_name_rep, -sample_id, -replicate) %>%
dplyr::select(sample_name, everything())
# Add delta_cp_avg
cp_long <- as.data.frame(delta_cp_avg) %>%
tibble::rownames_to_column("gene") %>%
tidyr::pivot_longer(-gene, names_to = "sample_name", values_to = "delta_cp") %>%
dplyr::right_join(cp_long, by = c("sample_name", "gene"))
# Add delta_cp_avg_plot
cp_long <- as.data.frame(delta_cp_avg_plot) %>%
tibble::rownames_to_column("gene") %>%
tidyr::pivot_longer(-gene, names_to = "sample_name", values_to = "delta_cp_plot") %>%
dplyr::right_join(cp_long, by = c("sample_name", "gene"))
cp_long_target <- dplyr::filter(cp_long, gene_group == "target")
You can check for outliers, as these could mess up a differential expression analysis.
plot_pca_ggplot2(delta_cp_avg_plot, pheno_data_avg, color_by = "sample_group", shape_by = "confluence")$plot
plot_pca_ggpairs(delta_cp_avg_plot, pheno_data_avg, n_components = 4, color_by = "sample_group", shape_by = "confluence")$plot
for (variable in color_variables) {
plot_hc(
delta_cp_avg_plot,
color_by = pheno_data_avg[, variable],
method_distance = "euclidean",
method_clustering = "complete",
color_by_lab = variable
)
}
plot_pheatmap(delta_cp_avg_plot, sample_data = pheno_data_avg, column_color_by = color_variables, main = "qPCR")
Now we calculate the relative change in gene expression between sample groups. For the sake of simplicity, we don’t take into account other variables, e.g. confluence. Also from the exploratory analysis you can see that these variables don’t have much effect on the separation of samples - sample group has the largest influence.
First, we calculate the \(mean(C_P)\) of control samples:
(control_means <- cp_long_target %>%
dplyr::filter(sample_group == "control") %>%
dplyr::group_by(gene) %>%
dplyr::summarise(delta_cp_control_mean = mean(delta_cp, na.rm = TRUE))
)
Then we join this table with our long data and calculate:
cp_long_target <- cp_long_target %>%
dplyr::left_join(control_means, by = "gene") %>%
dplyr::mutate(
delta_delta_cp = delta_cp - delta_cp_control_mean,
lfc = -delta_delta_cp,
fc = 2^lfc
) %>%
dplyr::select(sample_name, gene, cp, cp_plot, delta_cp, delta_cp_plot, delta_cp_control_mean, delta_delta_cp, lfc, fc, everything())
head(cp_long_target)
\(\text{mean}(\Delta \Delta C_{P, control})\) is now \(0\), because we subtracted \(\text{mean}(\Delta C_P, control)\) also from the control samples. Of course this also applies for \(LFC\), which is just \(-\Delta \Delta C_P\). As you will see, this is useful for plotting (boxplots).
Boxplots are very informative when you want to see a difference between groups.
TASK 5: Implement a function to produce a boxplot of values. It should work for any type of long data, as you will be able to specify the grouping column to split boxplots on x-axe, and column with values to plot on y-axe.
First we will plot the \(\Delta C_P\) values:
plot_boxplot_ggplot2(
cp_long_target,
x = "sample_group",
y = "delta_cp",
feature_col = "gene",
color_by = "sample_group",
main = "delta Cp values",
y_lab = "delta Cp"
)
And this is tricky to interpret at the first sight, because in spirulina group, all genes are actually upregulated! Also, you cannot directly see the \(LFCs\) - in your head, you have to calculate the difference between spirulina and control and flip the sign.
So better is to plot the \(LFCs\). Why not the \(\text{fold-changes}\)? Because \(LFCs\) are easier to read, as changes (relative to controls) are multiplicative and symmetric:
| \(\text{fold-change}\) | 0.125 | 0.25 | 0.5 | 1 | 2 | 4 | 8 |
| \(log_2(\text{fold-change})\) | -3 | -2 | -1 | 0 | 1 | 2 | 3 |
| You read: Change is… | 8x less | 4x less | 2x less | no change | 2x more | 4x more | 8x more |
This figure summarises how \(LFCs\) were calculated and why it is better for boxplots :)
plot_boxplot_ggplot2(
cp_long_target,
x = "sample_group",
y = "lfc",
feature_col = "gene",
color_by = "sample_group",
main = "log2 fold-changes",
y_lab = "log2 fold-change"
)
We can also split plots by confluence:
for (g in target_genes) {
p1 <- plot_boxplot_ggplot2(
cp_long_target,
x = "sample_group",
y = "lfc",
feature_col = "gene",
feature = g,
main = "log2 fold-changes",
color_by = "sample_group",
y_lab = "log2 fold-change"
)
p2 <- plot_boxplot_ggplot2(
cp_long_target,
x = "sample_group",
y = "lfc",
feature_col = "gene",
feature = g,
main = "log2 fold-changes",
color_by = "sample_group",
facet_by = "confluence",
y_lab = "log2 fold-change"
)
p1_p2 <- cowplot::plot_grid(p1, p2, ncol = 2)
print(p1_p2)
}
t-test is testing whether two means, coming from samples having Student’s distribution, significantly differ. t-test assumes that the input \(C_P\) values are normally distributed and the variance between conditions is comparable. Wilcoxon test can be used when sample size is small and those two last assumptions are hard to achieve.
As we will use two-sided alternative (i.e. gene is up- or downregulated = deregulated), it doesn’t matter which one of \(\Delta C_P\), \(\Delta \Delta C_P\) or \(LFC\) we will use. In all cases, the absolute difference between group means remains the same:
group_means <- dplyr::filter(cp_long_target, gene == "CD24") %>%
dplyr::group_by(sample_group) %>%
dplyr::summarise(
delta_cp_mean = mean(delta_cp),
delta_delta_cp_mean = mean(delta_delta_cp),
lfc_mean = mean(lfc)
) %>%
as.data.frame() %>%
tibble::column_to_rownames("sample_group")
group_means
apply(group_means, MARGIN = 2, FUN = function(x) abs(x[1] - x[2]))
## delta_cp_mean delta_delta_cp_mean lfc_mean
## 1.057847 1.057847 1.057847
TASK 6: Implement a function which do the t-test of gene from two groups.
for (g in target_genes) {
test_results <- test_gene(
gene = g,
gene_data = cp_long_target,
gene_col = "gene",
value_col = "lfc",
group_col = "sample_group",
test = t.test,
verbose = TRUE
)
cat("\n")
}
## CD24: spirulina vs. control
## t.test p-value: 0.000959 ***
##
## CD31: spirulina vs. control
## t.test p-value: 0.00535 **
##
## CXCR4: spirulina vs. control
## t.test p-value: 0.168 NS
##
## VEGFA: spirulina vs. control
## t.test p-value: 0.00234 **
Or you can return a table for all genes:
(test_table <- test_gene_table(
gene_data = cp_long_target,
gene_col = "gene",
value_col = "delta_cp",
group_col = "sample_group"
))
save.image("qPCR.RData")
warnings()
traceback()
## No traceback available
sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.7.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.7.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] friendlyeval_0.1.1 gplots_3.0.3 ggplot2_3.3.0 rlist_0.4.6.1 RColorBrewer_1.1-2 magrittr_1.5 tidyr_1.0.2 stringr_1.4.0 dplyr_0.8.4
## [10] glue_1.3.1 knitr_1.28 rmarkdown_2.1
##
## loaded via a namespace (and not attached):
## [1] viridis_0.5.1 httr_1.4.1 jsonlite_1.6.1 viridisLite_0.3.0 foreach_1.4.8 gtools_3.8.1 shiny_1.4.0 assertthat_0.2.1 yaml_2.2.1
## [10] ggrepel_0.8.1 pillar_1.4.3 digest_0.6.25 ggsignif_0.6.0 promises_1.1.0 colorspace_1.4-1 cowplot_1.0.0 htmltools_0.4.0 httpuv_1.5.2
## [19] plyr_1.8.6 pkgconfig_2.0.3 ggcorrplot_0.1.3 pheatmap_1.0.12 xtable_1.8-4 purrr_0.3.3 scales_1.1.0 webshot_0.5.2 gdata_2.18.0
## [28] later_1.0.0 tibble_2.1.3 farver_2.0.3 ggpubr_0.2.5 ellipsis_0.3.0 withr_2.1.2 lazyeval_0.2.2 mime_0.9 crayon_1.3.4
## [37] heatmaply_1.0.0 evaluate_0.14 GGally_1.4.0 MASS_7.3-51.5 ggthemes_4.2.0 tools_3.6.3 registry_0.5-1 data.table_1.12.8 lifecycle_0.2.0
## [46] plotly_4.9.2 munsell_0.5.0 cluster_2.1.0 packrat_0.5.0 compiler_3.6.3 caTools_1.18.0 rlang_0.4.5 grid_3.6.3 iterators_1.0.12
## [55] htmlwidgets_1.5.1 crosstalk_1.0.0 bitops_1.0-6 labeling_0.3 gtable_0.3.0 codetools_0.2-16 reshape_0.8.8 TSP_1.1-9 reshape2_1.4.3
## [64] R6_2.4.1 seriation_1.2-8 gridExtra_2.3 fastmap_1.0.1 KernSmooth_2.23-16 dendextend_1.13.4 stringi_1.4.6 Rcpp_1.0.3 vctrs_0.2.3
## [73] gclus_1.3.2 tidyselect_1.0.0 xfun_0.12
This chunk is not evaluated (eval = FALSE). Otherwise you will probably end up in recursive hell :)
library(rmarkdown)
library(knitr)
library(glue)
# You can set global chunk options. Options set in individual chunks will override this.
opts_chunk$set(warning = FALSE, message = FALSE)
render("qPCR.Rmd", output_file = "qPCR.html")